Efficient Reinforcement Learning Using Recursive Least-Squares Methods
نویسندگان
چکیده
The recursive least-squares (RLS) algorithm is one of the most well-known algorithms used in adaptive filtering, system identification and adaptive control. Its popularity is mainly due to its fast convergence speed, which is considered to be optimal in practice. In this paper, RLS methods are used to solve reinforcement learning problems, where two new reinforcement learning algorithms using linear value function approximators are proposed and analyzed. The two algorithms are called RLS-TD( λ ) and Fast-AHC (Fast Adaptive Heuristic Critic), respectively. RLS-TD( λ ) can be viewed as the extension of RLS-TD(0) from λ =0 to general 0≤ λ ≤1, so it is a multi-step temporal-difference (TD) learning algorithm using RLS methods. The convergence with probability one and the limit of convergence of RLS-TD( λ ) are proved for ergodic Markov chains. Compared to the existing LS-TD( λ ) algorithm, RLS-TD( λ ) has advantages in computation and is more suitable for online learning. The effectiveness of RLS-TD( λ ) is analyzed and verified by learning prediction experiments of Markov chains with a wide range of parameter settings. The Fast-AHC algorithm is derived by applying the proposed RLS-TD( λ ) algorithm in the critic network of the adaptive heuristic critic method. Unlike conventional AHC algorithm, Fast-AHC makes use of RLS methods to improve the learning-prediction efficiency in the critic. Learning control experiments of the cart-pole balancing and the acrobot swing-up problems are conducted to compare the data efficiency of Fast-AHC with conventional AHC. From the experimental results, it is shown that the data efficiency of learning control can also be improved by using RLS methods in the learning-prediction process of the critic. The performance of Fast-AHC is also compared with that of the AHC method using LS-TD( λ ). Furthermore, it is demonstrated in the experiments that different initial values of the variance matrix in RLS-TD( λ ) are required to get better performance not only in learning prediction but also in learning control. The experimental results are analyzed based on the existing theoretical work on the transient phase of forgetting factor RLS methods.
منابع مشابه
A general fuzzified CMAC based reinforcement learning control for ship steering using recursive least-squares algorithm
Recursive least-squares temporal difference algorithm (RLS-TD) is deduced, which can use data more efficiently with fast convergence and less computational burden. Reinforcement learning based on recursive least-squares methods is applied to ship steering control, as provides an efficient way for the improvement of ship steering control performance. It removes the defect that the conventional i...
متن کاملAn RLS-Based Natural Actor-Critic Algorithm for Locomotion of a Two-Linked Robot Arm
Recently, actor-critic methods have drawn much interests in the area of reinforcement learning, and several algorithms have been studied along the line of the actor-critic strategy. This paper studies an actor-critic type algorithm utilizing the RLS(recursive least-squares) method, which is one of the most efficient techniques for adaptive signal processing, together with natural policy gradien...
متن کاملSustainable ℓ2-regularized actor-critic based on recursive least-squares temporal difference learning
Least-squares temporal difference learning (LSTD) has been used mainly for improving the data efficiency of the critic in actor-critic (AC). However, convergence analysis of the resulted algorithms is difficult when policy is changing. In this paper, a new AC method is proposed based on LSTD under discount criterion. The method comprises two components as the contribution: (1) LSTD works in an ...
متن کاملLearning RoboCup-Keepaway with Kernels
We apply kernel-based methods to solve the difficult reinforcement learning problem of 3vs2 keepaway in RoboCup simulated soccer. Key challenges in keepaway are the highdimensionality of the state space (rendering conventional discretization-based function approximation like tilecoding infeasible), the stochasticity due to noise and multiple learning agents needing to cooperate (meaning that th...
متن کاملAn Algorithmic Survey of Parametric Value Function Approximation
Reinforcement learning is a machine learning answer to the optimal control problem. It consists in learning an optimal control policy through interactions with the system to be controlled, the quality of this policy being quantified by the so-called value function. A recurrent subtopic of reinforcement learning is to compute an approximation of this value function when the system is too large f...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- J. Artif. Intell. Res.
دوره 16 شماره
صفحات -
تاریخ انتشار 2002